The dataset contains 10,000 sample points with 14 distinct features such as CustomerId, CreditScore, Geography, Gender, Age, Tenure, Balance, etc.
The detailed data dictionary is given below:
Customer Details
# Importing the Python Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
from IPython.display import Image
# Importing libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import PercentFormatter
# Used for Ignore warnings. When we generate the output, then we can use this ignore warning
import warnings
warnings.filterwarnings("ignore")
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
# this will help in making the Python code more structured automatically (good coding practice)
!pip install nb-black
%reload_ext nb_black
# Command to tell Python to actually display the graphs
%matplotlib inline
# let's start by installing plotly
!pip install plotly
# importing plotly
import plotly.express as px
# Command to hide the 'already satisfied' warnining from displaying
%pip install keras | grep -v 'already satisfied'
# Constant for making bold text
boldText = "\033[1m"
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 500)
# to split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to build Bagging model
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# to build Boosting model
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
pd.set_option("mode.chained_assignment", None)
# To build model for prediction
from sklearn.linear_model import LogisticRegression
# To get diferent metric scores
# To tune different models
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
plot_confusion_matrix,
precision_recall_curve,
roc_curve,
make_scorer,
)
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To be used for creating pipelines and personalizing them
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Install library using
# In jupyter notebook
# !pip install shap
# or
# In anaconda command prompt
# conda install -c conda-forge shap - in conda prompt
import shap
#Importing classback API
from keras import callbacks
# Importing tensorflow library
import tensorflow as tf
# importing different functions to build models
from tensorflow.keras.layers import Dense, Dropout,InputLayer
from tensorflow.keras.models import Sequential
# Importing Batch Normalization
from keras.layers import BatchNormalization
# Importing backend
from tensorflow.keras import backend
# Importing shffule
from random import shuffle
from keras.callbacks import ModelCheckpoint
# Importing optimizers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers import RMSprop
from sklearn.metrics import classification_report
Requirement already satisfied: nb-black in c:\users\cpaul\anaconda3\lib\site-packages (1.0.7) Requirement already satisfied: ipython in c:\users\cpaul\anaconda3\lib\site-packages (from nb-black) (7.29.0) Requirement already satisfied: black>='19.3' in c:\users\cpaul\anaconda3\lib\site-packages (from nb-black) (19.10b0) Requirement already satisfied: attrs>=18.1.0 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (21.2.0) Requirement already satisfied: appdirs in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (1.4.4) Requirement already satisfied: regex in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (2021.8.3) Requirement already satisfied: click>=6.5 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (8.0.3) Requirement already satisfied: typed-ast>=1.4.0 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (1.4.3) Requirement already satisfied: pathspec<1,>=0.6 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (0.7.0) Requirement already satisfied: toml>=0.9.4 in c:\users\cpaul\anaconda3\lib\site-packages (from black>='19.3'->nb-black) (0.10.2) Requirement already satisfied: colorama in c:\users\cpaul\anaconda3\lib\site-packages (from click>=6.5->black>='19.3'->nb-black) (0.4.4) Requirement already satisfied: decorator in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (5.1.0) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (3.0.20) Requirement already satisfied: pygments in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (2.10.0) Requirement already satisfied: traitlets>=4.2 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (5.1.0) Requirement already satisfied: setuptools>=18.5 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (58.0.4) Requirement already satisfied: jedi>=0.16 in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.18.0) Requirement already satisfied: matplotlib-inline in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.1.2) Requirement already satisfied: backcall in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.2.0) Requirement already satisfied: pickleshare in c:\users\cpaul\anaconda3\lib\site-packages (from ipython->nb-black) (0.7.5) Requirement already satisfied: parso<0.9.0,>=0.8.0 in c:\users\cpaul\anaconda3\lib\site-packages (from jedi>=0.16->ipython->nb-black) (0.8.2) Requirement already satisfied: wcwidth in c:\users\cpaul\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->nb-black) (0.2.5) Requirement already satisfied: plotly in c:\users\cpaul\anaconda3\lib\site-packages (5.7.0) Requirement already satisfied: six in c:\users\cpaul\anaconda3\lib\site-packages (from plotly) (1.16.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\cpaul\anaconda3\lib\site-packages (from plotly) (8.0.1) Note: you may need to restart the kernel to use updated packages.
# Loading Dataset
df = pd.read_csv("../Dataset/Churn.csv")
# same random results every time
np.random.seed(1)
df.sample(n=10)
# To copy the data to another object
custData = df.copy()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 RowNumber 10000 non-null int64 1 CustomerId 10000 non-null int64 2 Surname 10000 non-null object 3 CreditScore 10000 non-null int64 4 Geography 10000 non-null object 5 Gender 10000 non-null object 6 Age 10000 non-null int64 7 Tenure 10000 non-null int64 8 Balance 10000 non-null float64 9 NumOfProducts 10000 non-null int64 10 HasCrCard 10000 non-null int64 11 IsActiveMember 10000 non-null int64 12 EstimatedSalary 10000 non-null float64 13 Exited 10000 non-null int64 dtypes: float64(2), int64(9), object(3) memory usage: 1.1+ MB
# Command to understand the total number of data collected
print(
f"- There are {df.shape[0]} row samples and {df.shape[1]} attributes of the customer information collected in this dataset."
)
- There are 10000 row samples and 14 attributes of the customer information collected in this dataset.
df.head(10) # Displaying the fist 10 rows of the Dataset
| RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
| 5 | 6 | 15574012 | Chu | 645 | Spain | Male | 44 | 8 | 113755.78 | 2 | 1 | 0 | 149756.71 | 1 |
| 6 | 7 | 15592531 | Bartlett | 822 | France | Male | 50 | 7 | 0.00 | 2 | 1 | 1 | 10062.80 | 0 |
| 7 | 8 | 15656148 | Obinna | 376 | Germany | Female | 29 | 4 | 115046.74 | 4 | 1 | 0 | 119346.88 | 1 |
| 8 | 9 | 15792365 | He | 501 | France | Male | 44 | 4 | 142051.07 | 2 | 0 | 1 | 74940.50 | 0 |
| 9 | 10 | 15592389 | H? | 684 | France | Male | 27 | 2 | 134603.88 | 1 | 1 | 1 | 71725.73 | 0 |
df.tail(10) # Displaying the last 10 rows of the Dataset
| RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9990 | 9991 | 15798964 | Nkemakonam | 714 | Germany | Male | 33 | 3 | 35016.60 | 1 | 1 | 0 | 53667.08 | 0 |
| 9991 | 9992 | 15769959 | Ajuluchukwu | 597 | France | Female | 53 | 4 | 88381.21 | 1 | 1 | 0 | 69384.71 | 1 |
| 9992 | 9993 | 15657105 | Chukwualuka | 726 | Spain | Male | 36 | 2 | 0.00 | 1 | 1 | 0 | 195192.40 | 0 |
| 9993 | 9994 | 15569266 | Rahman | 644 | France | Male | 28 | 7 | 155060.41 | 1 | 1 | 0 | 29179.52 | 0 |
| 9994 | 9995 | 15719294 | Wood | 800 | France | Female | 29 | 2 | 0.00 | 2 | 0 | 0 | 167773.55 | 0 |
| 9995 | 9996 | 15606229 | Obijiaku | 771 | France | Male | 39 | 5 | 0.00 | 2 | 1 | 0 | 96270.64 | 0 |
| 9996 | 9997 | 15569892 | Johnstone | 516 | France | Male | 35 | 10 | 57369.61 | 1 | 1 | 1 | 101699.77 | 0 |
| 9997 | 9998 | 15584532 | Liu | 709 | France | Female | 36 | 7 | 0.00 | 1 | 0 | 1 | 42085.58 | 1 |
| 9998 | 9999 | 15682355 | Sabbatini | 772 | Germany | Male | 42 | 3 | 75075.31 | 2 | 1 | 0 | 92888.52 | 1 |
| 9999 | 10000 | 15628319 | Walker | 792 | France | Female | 28 | 4 | 130142.79 | 1 | 1 | 0 | 38190.78 | 0 |
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| RowNumber | 10000.0 | NaN | NaN | NaN | 5000.5 | 2886.89568 | 1.0 | 2500.75 | 5000.5 | 7500.25 | 10000.0 |
| CustomerId | 10000.0 | NaN | NaN | NaN | 15690940.5694 | 71936.186123 | 15565701.0 | 15628528.25 | 15690738.0 | 15753233.75 | 15815690.0 |
| Surname | 10000 | 2932 | Smith | 32 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CreditScore | 10000.0 | NaN | NaN | NaN | 650.5288 | 96.653299 | 350.0 | 584.0 | 652.0 | 718.0 | 850.0 |
| Geography | 10000 | 3 | France | 5014 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 10000 | 2 | Male | 5457 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 10000.0 | NaN | NaN | NaN | 38.9218 | 10.487806 | 18.0 | 32.0 | 37.0 | 44.0 | 92.0 |
| Tenure | 10000.0 | NaN | NaN | NaN | 5.0128 | 2.892174 | 0.0 | 3.0 | 5.0 | 7.0 | 10.0 |
| Balance | 10000.0 | NaN | NaN | NaN | 76485.889288 | 62397.405202 | 0.0 | 0.0 | 97198.54 | 127644.24 | 250898.09 |
| NumOfProducts | 10000.0 | NaN | NaN | NaN | 1.5302 | 0.581654 | 1.0 | 1.0 | 1.0 | 2.0 | 4.0 |
| HasCrCard | 10000.0 | NaN | NaN | NaN | 0.7055 | 0.45584 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| IsActiveMember | 10000.0 | NaN | NaN | NaN | 0.5151 | 0.499797 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| EstimatedSalary | 10000.0 | NaN | NaN | NaN | 100090.239881 | 57510.492818 | 11.58 | 51002.11 | 100193.915 | 149388.2475 | 199992.48 |
| Exited | 10000.0 | NaN | NaN | NaN | 0.2037 | 0.402769 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
# creating histograms
df.hist(figsize=(14, 14))
plt.show()
Data Description: Click to return to TOC
RowNumber * CustomerId - There are 10000 samples of customers that has been provided in the dataset, featuring 14 informations of each customerSurname - The Last Name of the customers. It looks like there are 32 Customers with same last name of "Smith" and around 2932 unique last names out of 10000 recordsCreditScore - The credit scores ranges from 350 - 850 with 650 being the average & median. More than 50% of the customers have score more than 650Geography - There are 3 locations being considered with France having more number of CustomersGender - Male customers seem to be more in the dataset with almost 5457 recordsAge - The age of the customers are ranging from 18 - 92 with average age being 39. Almost 75% of the customers are less than the age of 44Tenure - The number of years of association with the bank ranges from 0 to 10 years with almost 50% of them associated for 5+ yearsBalance - Customers seems to have an account balance ranging from 0 to 250K with 50% of the customers having more than 97K of balanceNumOfProducts - There are 4 products offered by the bank and almost 50% of the customers have only one productHasCrCard - More than 25% of the customers have a credit card IsActiveMember - Almost 50% of the customers are actively using the bank accountEstimatedSalary - 50% of the customers have more than 100K salary and the max salary is around 200KExited - This will be the Target variable for analysing the model. Almost 75% of the customers are continuing with the bank and 25% have leftdf.nunique()
RowNumber 10000 CustomerId 10000 Surname 2932 CreditScore 460 Geography 3 Gender 2 Age 70 Tenure 11 Balance 6382 NumOfProducts 4 HasCrCard 2 IsActiveMember 2 EstimatedSalary 9999 Exited 2 dtype: int64
Observations:
RowNumber, CustomerId & Surname as it is unique for each customer and will not add value to the model# Dropping the Cusitmer ID/Information columns since its not required
df.drop(["RowNumber"], axis=1, inplace=True)
df.drop(["CustomerId"], axis=1, inplace=True)
df.drop(["Surname"], axis=1, inplace=True)
# Checking for duplicated rows in the dataset
duplicateSum = df.duplicated().sum()
print("**Inferences:**")
if duplicateSum > 0:
print(f"- There are {str(duplicateSum)} duplicated row(s) in the dataset")
# Removing the duplicated rows in the dataset
df.drop_duplicates(inplace=True)
print(
f"- There are {str(df.duplicated().sum())} duplicated row(s) in the dataset post cleaning"
)
df.duplicated().sum()
# resetting the index of data frame since some rows will be removed
df.reset_index(drop=True, inplace=True)
else:
print("- There are no duplicated row(s) in the dataset")
**Inferences:** - There are no duplicated row(s) in the dataset
df.isnull().sum()
CreditScore 0 Geography 0 Gender 0 Age 0 Tenure 0 Balance 0 NumOfProducts 0 HasCrCard 0 IsActiveMember 0 EstimatedSalary 0 Exited 0 dtype: int64
Observations:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditScore 10000 non-null int64 1 Geography 10000 non-null object 2 Gender 10000 non-null object 3 Age 10000 non-null int64 4 Tenure 10000 non-null int64 5 Balance 10000 non-null float64 6 NumOfProducts 10000 non-null int64 7 HasCrCard 10000 non-null int64 8 IsActiveMember 10000 non-null int64 9 EstimatedSalary 10000 non-null float64 10 Exited 10000 non-null int64 dtypes: float64(2), int64(7), object(2) memory usage: 859.5+ KB
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CreditScore | 10000.0 | NaN | NaN | NaN | 650.5288 | 96.653299 | 350.0 | 584.0 | 652.0 | 718.0 | 850.0 |
| Geography | 10000 | 3 | France | 5014 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 10000 | 2 | Male | 5457 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 10000.0 | NaN | NaN | NaN | 38.9218 | 10.487806 | 18.0 | 32.0 | 37.0 | 44.0 | 92.0 |
| Tenure | 10000.0 | NaN | NaN | NaN | 5.0128 | 2.892174 | 0.0 | 3.0 | 5.0 | 7.0 | 10.0 |
| Balance | 10000.0 | NaN | NaN | NaN | 76485.889288 | 62397.405202 | 0.0 | 0.0 | 97198.54 | 127644.24 | 250898.09 |
| NumOfProducts | 10000.0 | NaN | NaN | NaN | 1.5302 | 0.581654 | 1.0 | 1.0 | 1.0 | 2.0 | 4.0 |
| HasCrCard | 10000.0 | NaN | NaN | NaN | 0.7055 | 0.45584 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| IsActiveMember | 10000.0 | NaN | NaN | NaN | 0.5151 | 0.499797 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
| EstimatedSalary | 10000.0 | NaN | NaN | NaN | 100090.239881 | 57510.492818 | 11.58 | 51002.11 | 100193.915 | 149388.2475 | 199992.48 |
| Exited | 10000.0 | NaN | NaN | NaN | 0.2037 | 0.402769 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
# printing the number of occurrences of each unique value in each categorical column
num_to_display = 10
for column in df.describe(include="all").columns:
val_counts = df[column].value_counts(dropna=False) # Kept dropNA to False to see the NA value count as well
print("Unique values in", column, "are :")
print(val_counts.iloc[:num_to_display])
if len(val_counts) > num_to_display:
print(f"Only displaying first {num_to_display} of {len(val_counts)} values.")
print("-" * 50)
print(" ")
Unique values in CreditScore are : 850 233 678 63 655 54 705 53 667 53 684 52 670 50 651 50 683 48 652 48 Name: CreditScore, dtype: int64 Only displaying first 10 of 460 values. -------------------------------------------------- Unique values in Geography are : France 5014 Germany 2509 Spain 2477 Name: Geography, dtype: int64 -------------------------------------------------- Unique values in Gender are : Male 5457 Female 4543 Name: Gender, dtype: int64 -------------------------------------------------- Unique values in Age are : 37 478 38 477 35 474 36 456 34 447 33 442 40 432 39 423 32 418 31 404 Name: Age, dtype: int64 Only displaying first 10 of 70 values. -------------------------------------------------- Unique values in Tenure are : 2 1048 1 1035 7 1028 8 1025 5 1012 3 1009 4 989 9 984 6 967 10 490 Name: Tenure, dtype: int64 Only displaying first 10 of 11 values. -------------------------------------------------- Unique values in Balance are : 0.00 3617 130170.82 2 105473.74 2 85304.27 1 159397.75 1 144238.70 1 112262.84 1 109106.80 1 142147.32 1 109109.33 1 Name: Balance, dtype: int64 Only displaying first 10 of 6382 values. -------------------------------------------------- Unique values in NumOfProducts are : 1 5084 2 4590 3 266 4 60 Name: NumOfProducts, dtype: int64 -------------------------------------------------- Unique values in HasCrCard are : 1 7055 0 2945 Name: HasCrCard, dtype: int64 -------------------------------------------------- Unique values in IsActiveMember are : 1 5151 0 4849 Name: IsActiveMember, dtype: int64 -------------------------------------------------- Unique values in EstimatedSalary are : 24924.92 2 101348.88 1 55313.44 1 72500.68 1 182692.80 1 4993.94 1 124964.82 1 161971.42 1 39488.04 1 187811.71 1 Name: EstimatedSalary, dtype: int64 Only displaying first 10 of 9999 values. -------------------------------------------------- Unique values in Exited are : 0 7963 1 2037 Name: Exited, dtype: int64 --------------------------------------------------
Observations:
CreditScore - Around 233 customers have a score of 850 and the credit score is uniformly distributedGeography - There are 3 locations being considered France, Germany & Spain, with France having more number of CustomersGender - Male customers seem to be more in the dataset with almost 5457 recordsAge - Customers within the range of 30-40 seems to be the top set of usersTenure - Customers with 2-1 years seems to be the max number of of association with the bank, followd by 7-8 yearsBalance - The data contains a lot of zero balance customers which needs to be treatedNumOfProducts - There are 4 products offered by the bank and customers have taken at least 1 product the max. Followed by 2 products and then 3, 4.HasCrCard - More than 25% of the customers have a credit card IsActiveMember - Almost 50% of the customers are actively using the bank accountExited - This will be the Target variable for analysing the model. Almost 75% of the customers have not exited yet and continuing with the bankInferences:
df["Geography"] = df["Geography"].astype("category")
df["Gender"] = df["Gender"].astype("category")
df["Tenure"] = df["Tenure"].astype("category")
df["NumOfProducts"] = df["NumOfProducts"].astype("category")
df["HasCrCard"] = df["HasCrCard"].astype("category")
df["IsActiveMember"] = df["IsActiveMember"].astype("category")
df["Exited"] = df["Exited"].astype("category")
# Defining bins for splitting the age to groups and creating a new column to review the relationship
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
labels = [
"Less_than_20",
"Less_than_30",
"Less_than_40",
"Less_than_50",
"Less_than_60",
"Less_than_70",
"Less_than_80",
"Less_than_90",
"Less_than_100",
]
df["Age_Grp"] = pd.cut(df["Age"], bins=bins, labels=labels, right=False)
df["Age_Grp"] = df["Age_Grp"].astype("category")
df["Age_Grp"].value_counts(dropna=False)
Less_than_40 4346 Less_than_50 2618 Less_than_30 1592 Less_than_60 869 Less_than_70 375 Less_than_80 136 Less_than_20 49 Less_than_90 13 Less_than_100 2 Name: Age_Grp, dtype: int64
# Defining bins for credit score and creating a new column to review the relationship
bins = [300, 400, 500, 600, 700, 800, 900]
labels = [
"CS_300_400",
"CS_400_500",
"CS_500_600",
"CS_600_700",
"CS_700_800",
"CS_800_900",
]
df["CreditScore_Grp"] = pd.cut(df["CreditScore"], bins=bins, labels=labels, right=False)
df["CreditScore_Grp"] = df["CreditScore_Grp"].astype("category")
df["CreditScore_Grp"].value_counts(dropna=False)
CS_600_700 3818 CS_700_800 2493 CS_500_600 2402 CS_800_900 655 CS_400_500 613 CS_300_400 19 Name: CreditScore_Grp, dtype: int64
# Defining bins for Salary and creating a new column to review the relationship
bins = [0, 50000, 100000, 150000, 200000]
labels = [
"Lessthan_50K",
"Between_50K-100K",
"Between_100K-150K",
"Between_150K-200K",
]
df["EstimatedSalary_Grp"] = pd.cut(
df["EstimatedSalary"], bins=bins, labels=labels, right=False
)
df["EstimatedSalary_Grp"] = df["EstimatedSalary_Grp"].astype("category")
df["EstimatedSalary_Grp"].value_counts(dropna=False)
Between_100K-150K 2555 Between_50K-100K 2537 Between_150K-200K 2455 Lessthan_50K 2453 Name: EstimatedSalary_Grp, dtype: int64
# Observing the data dictionery after the changes
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditScore 10000 non-null int64 1 Geography 10000 non-null category 2 Gender 10000 non-null category 3 Age 10000 non-null int64 4 Tenure 10000 non-null category 5 Balance 10000 non-null float64 6 NumOfProducts 10000 non-null category 7 HasCrCard 10000 non-null category 8 IsActiveMember 10000 non-null category 9 EstimatedSalary 10000 non-null float64 10 Exited 10000 non-null category 11 Age_Grp 10000 non-null category 12 CreditScore_Grp 10000 non-null category 13 EstimatedSalary_Grp 10000 non-null category dtypes: category(10), float64(2), int64(2) memory usage: 412.2 KB
# Command to understand the total number of data collected
print(
f"- There are {df.shape[0]} row samples and {df.shape[1]} attributes of the customer information collected in this dataset."
)
- There are 10000 row samples and 14 attributes of the customer information collected in this dataset.
# Identofying the category columns
category_columnNames = df.describe(include=["category"]).columns
category_columnNames
Index(['Geography', 'Gender', 'Tenure', 'NumOfProducts', 'HasCrCard',
'IsActiveMember', 'Exited', 'Age_Grp', 'CreditScore_Grp',
'EstimatedSalary_Grp'],
dtype='object')
# Identifying the numerical columns
number_columnNames = (
df.describe(include=["int64"]).columns.tolist()
+ df.describe(include=["float64"]).columns.tolist()
)
number_columnNames
['CreditScore', 'Age', 'Balance', 'EstimatedSalary']
df.describe(include="category").T
| count | unique | top | freq | |
|---|---|---|---|---|
| Geography | 10000 | 3 | France | 5014 |
| Gender | 10000 | 2 | Male | 5457 |
| Tenure | 10000 | 11 | 2 | 1048 |
| NumOfProducts | 10000 | 4 | 1 | 5084 |
| HasCrCard | 10000 | 2 | 1 | 7055 |
| IsActiveMember | 10000 | 2 | 1 | 5151 |
| Exited | 10000 | 2 | 0 | 7963 |
| Age_Grp | 10000 | 9 | Less_than_40 | 4346 |
| CreditScore_Grp | 10000 | 6 | CS_600_700 | 3818 |
| EstimatedSalary_Grp | 10000 | 4 | Between_100K-150K | 2555 |
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CreditScore | 10000.0 | 650.528800 | 96.653299 | 350.00 | 584.00 | 652.000 | 718.0000 | 850.00 |
| Age | 10000.0 | 38.921800 | 10.487806 | 18.00 | 32.00 | 37.000 | 44.0000 | 92.00 |
| Balance | 10000.0 | 76485.889288 | 62397.405202 | 0.00 | 0.00 | 97198.540 | 127644.2400 | 250898.09 |
| EstimatedSalary | 10000.0 | 100090.239881 | 57510.492818 | 11.58 | 51002.11 | 100193.915 | 149388.2475 | 199992.48 |
Data Structure:
Data Cleaning:
RowNumber, CustomerId & Surname attributes are not required and the column was droppedData Insight:
For more data information details, refer comments in Data descriptions & Feature Value observations
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None, hueCol=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 7))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
hue=hueCol,
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
)
# annotate the percentage
plt.show() # show the plot
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True,).sort_values(
by=sorter, ascending=False
)
print("-" * 30, " Volume ", "-" * 30)
print(tab1)
tab1 = pd.crosstab(
data[predictor], data[target], margins=True, normalize="index"
).sort_values(by=sorter, ascending=False)
print("-" * 30, " Percentage % ", "-" * 30)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# Creating a common function to draw a Boxplot & a Histogram for each of the analysis
def histogram_boxplot(data, feature, figsize=(15, 7), kde=True, bins=None):
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
if bins:
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
)
else:
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# functions to treat outliers by flooring and capping
def treat_outliers(df, col, lower=0.25, upper=0.75, mul=1.5):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(lower) # 25th quantile
Q3 = df[col].quantile(upper) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - (mul * IQR)
Upper_Whisker = Q3 + (mul * IQR)
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list, lower=0.25, upper=0.75, mul=1.5):
"""
Treat outliers in a list of variables
df: dataframe
col_list: list of dataframe columns
"""
for c in col_list:
df = treat_outliers(df, c, lower, upper, mul)
return df
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditScore 10000 non-null int64 1 Geography 10000 non-null category 2 Gender 10000 non-null category 3 Age 10000 non-null int64 4 Tenure 10000 non-null category 5 Balance 10000 non-null float64 6 NumOfProducts 10000 non-null category 7 HasCrCard 10000 non-null category 8 IsActiveMember 10000 non-null category 9 EstimatedSalary 10000 non-null float64 10 Exited 10000 non-null category 11 Age_Grp 10000 non-null category 12 CreditScore_Grp 10000 non-null category 13 EstimatedSalary_Grp 10000 non-null category dtypes: category(10), float64(2), int64(2) memory usage: 412.2 KB
# Data Secription of Categorical variables
df.describe(include="category").T
| count | unique | top | freq | |
|---|---|---|---|---|
| Geography | 10000 | 3 | France | 5014 |
| Gender | 10000 | 2 | Male | 5457 |
| Tenure | 10000 | 11 | 2 | 1048 |
| NumOfProducts | 10000 | 4 | 1 | 5084 |
| HasCrCard | 10000 | 2 | 1 | 7055 |
| IsActiveMember | 10000 | 2 | 1 | 5151 |
| Exited | 10000 | 2 | 0 | 7963 |
| Age_Grp | 10000 | 9 | Less_than_40 | 4346 |
| CreditScore_Grp | 10000 | 6 | CS_600_700 | 3818 |
| EstimatedSalary_Grp | 10000 | 4 | Between_100K-150K | 2555 |
# Data Secription of Categorical variables
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CreditScore | 10000.0 | 650.528800 | 96.653299 | 350.00 | 584.00 | 652.000 | 718.0000 | 850.00 |
| Age | 10000.0 | 38.921800 | 10.487806 | 18.00 | 32.00 | 37.000 | 44.0000 | 92.00 |
| Balance | 10000.0 | 76485.889288 | 62397.405202 | 0.00 | 0.00 | 97198.540 | 127644.2400 | 250898.09 |
| EstimatedSalary | 10000.0 | 100090.239881 | 57510.492818 | 11.58 | 51002.11 | 100193.915 | 149388.2475 | 199992.48 |
# Summary of data
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CreditScore | 10000.0 | NaN | NaN | NaN | 650.5288 | 96.653299 | 350.0 | 584.0 | 652.0 | 718.0 | 850.0 |
| Geography | 10000 | 3 | France | 5014 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Gender | 10000 | 2 | Male | 5457 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age | 10000.0 | NaN | NaN | NaN | 38.9218 | 10.487806 | 18.0 | 32.0 | 37.0 | 44.0 | 92.0 |
| Tenure | 10000.0 | 11.0 | 2.0 | 1048.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Balance | 10000.0 | NaN | NaN | NaN | 76485.889288 | 62397.405202 | 0.0 | 0.0 | 97198.54 | 127644.24 | 250898.09 |
| NumOfProducts | 10000.0 | 4.0 | 1.0 | 5084.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| HasCrCard | 10000.0 | 2.0 | 1.0 | 7055.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| IsActiveMember | 10000.0 | 2.0 | 1.0 | 5151.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| EstimatedSalary | 10000.0 | NaN | NaN | NaN | 100090.239881 | 57510.492818 | 11.58 | 51002.11 | 100193.915 | 149388.2475 | 199992.48 |
| Exited | 10000.0 | 2.0 | 0.0 | 7963.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Age_Grp | 10000 | 9 | Less_than_40 | 4346 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| CreditScore_Grp | 10000 | 6 | CS_600_700 | 3818 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| EstimatedSalary_Grp | 10000 | 4 | Between_100K-150K | 2555 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# printing the number of occurrences of each unique value in each categorical column
num_to_display = 15
for column in category_columnNames:
val_counts = df[column].value_counts(
dropna=False
) # Kept dropNA to False to see the NA value count as well
#val_countsP = df[column].value_counts(dropna=False, normalize=True)
print("Unique values in", column, "are :")
print(val_counts.iloc[:num_to_display])
#print(val_countsP.iloc[:num_to_display])
if len(val_counts) > num_to_display:
print(f"Only displaying first {num_to_display} of {len(val_counts)} values.")
labeled_barplot(df, column, perc=True, n=5)
plt.tight_layout()
print("-" * 50)
print(" ")
Unique values in Geography are : France 5014 Germany 2509 Spain 2477 Name: Geography, dtype: int64
-------------------------------------------------- Unique values in Gender are : Male 5457 Female 4543 Name: Gender, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Tenure are : 2 1048 1 1035 7 1028 8 1025 5 1012 3 1009 4 989 9 984 6 967 10 490 0 413 Name: Tenure, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in NumOfProducts are : 1 5084 2 4590 3 266 4 60 Name: NumOfProducts, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in HasCrCard are : 1 7055 0 2945 Name: HasCrCard, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in IsActiveMember are : 1 5151 0 4849 Name: IsActiveMember, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Exited are : 0 7963 1 2037 Name: Exited, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in Age_Grp are : Less_than_40 4346 Less_than_50 2618 Less_than_30 1592 Less_than_60 869 Less_than_70 375 Less_than_80 136 Less_than_20 49 Less_than_90 13 Less_than_100 2 Name: Age_Grp, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in CreditScore_Grp are : CS_600_700 3818 CS_700_800 2493 CS_500_600 2402 CS_800_900 655 CS_400_500 613 CS_300_400 19 Name: CreditScore_Grp, dtype: int64
<Figure size 432x288 with 0 Axes>
-------------------------------------------------- Unique values in EstimatedSalary_Grp are : Between_100K-150K 2555 Between_50K-100K 2537 Between_150K-200K 2455 Lessthan_50K 2453 Name: EstimatedSalary_Grp, dtype: int64
<Figure size 432x288 with 0 Axes>
--------------------------------------------------
<Figure size 432x288 with 0 Axes>
Observations:
Age - 43.5% of the customers fall with in the range of 30-40 age limits, followed by 40-50 years
Credit Score - 38% of the customers are within the range of 600-700 scores, followed by 700-800 and 500-600 ranges respectively
Salary - Almost 25% of customers are uniformally distributed between 50K, 100K, 150K and 200K
Geography - 50% of the customers are from France, 25% from Germany and 25% from Spain
Gender - 55% of the customers are Male customers
Tenure - 50% of the customers seems to be uniformly distributed of around 10% across 1, 2, 5, 7, 8 years
NumOfProducts - There are 4 products offered by the bank. 51% of customers have taken 1 product and 46% of them have taken 2 products
HasCrCard - 71% of the customers have a credit card
IsActiveMember - Almost 52% of the customers are actively using the bank account
Exited - Almost 80% of the customers are still associated with the bank without exiting
# creating histograms
df.hist(figsize=(14, 14))
plt.show()
# Summary of numeric data
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CreditScore | 10000.0 | 650.528800 | 96.653299 | 350.00 | 584.00 | 652.000 | 718.0000 | 850.00 |
| Age | 10000.0 | 38.921800 | 10.487806 | 18.00 | 32.00 | 37.000 | 44.0000 | 92.00 |
| Balance | 10000.0 | 76485.889288 | 62397.405202 | 0.00 | 0.00 | 97198.540 | 127644.2400 | 250898.09 |
| EstimatedSalary | 10000.0 | 100090.239881 | 57510.492818 | 11.58 | 51002.11 | 100193.915 | 149388.2475 | 199992.48 |
Observations:
histogram_boxplot(df, "Balance")
# Replacing the Zero values with NAN
df["Balance"] = df["Balance"].replace(0, np.nan)
histogram_boxplot(df, "Balance")
df.groupby(["Age_Grp", "EstimatedSalary_Grp"])["Balance"].mean()
Age_Grp EstimatedSalary_Grp
Less_than_20 Lessthan_50K 113819.450909
Between_50K-100K 82767.420000
Between_100K-150K 133052.409091
Between_150K-200K 120991.620000
Less_than_30 Lessthan_50K 120279.868577
Between_50K-100K 120107.320491
Between_100K-150K 124012.452706
Between_150K-200K 120075.041811
Less_than_40 Lessthan_50K 119858.643436
Between_50K-100K 121777.056971
Between_100K-150K 117851.005694
Between_150K-200K 119362.658278
Less_than_50 Lessthan_50K 121623.465299
Between_50K-100K 117248.360970
Between_100K-150K 119485.156864
Between_150K-200K 119685.754170
Less_than_60 Lessthan_50K 120918.262353
Between_50K-100K 117758.934691
Between_100K-150K 123290.101656
Between_150K-200K 118581.151871
Less_than_70 Lessthan_50K 115276.973846
Between_50K-100K 117469.800000
Between_100K-150K 118625.630968
Between_150K-200K 126847.044667
Less_than_80 Lessthan_50K 109887.456800
Between_50K-100K 116123.040000
Between_100K-150K 104766.251500
Between_150K-200K 128955.719375
Less_than_90 Lessthan_50K NaN
Between_50K-100K 122692.890000
Between_100K-150K NaN
Between_150K-200K 90057.865000
Less_than_100 Lessthan_50K 126076.240000
Between_50K-100K NaN
Between_100K-150K NaN
Between_150K-200K 121513.310000
Name: Balance, dtype: float64
# Replacing NAN based on the Median values of the Grouping of Age & Estimated Salary
df["Balance"] = df["Balance"].fillna(
df.groupby(["Age", "EstimatedSalary"])["Balance"].transform("median")
)
# Replacing NAN based on the Median values of the Grouping of Age_Group & EstimatedSalary_Group
df["Balance"] = df["Balance"].fillna(
df.groupby(["Age_Grp", "EstimatedSalary_Grp"])["Balance"].transform("median")
)
# Replacing remaining NAN with 0's
df["Balance"] = df["Balance"].fillna(0)
df[df["Balance"].isna()]
| CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | Age_Grp | CreditScore_Grp | EstimatedSalary_Grp |
|---|
histogram_boxplot(df, "Balance")
histogram_boxplot(df, "CreditScore")
Observations:
histogram_boxplot(df, "Age")
Observations:
histogram_boxplot(df, "EstimatedSalary")
Observations:
# Creating a function to display the values of the attributes against the chart
for i, cols in zip(range(len(category_columnNames)), category_columnNames):
count = df[cols].nunique()
sorter = df["Exited"].value_counts(dropna=False).index[-1]
tab1 = pd.crosstab(df[cols], df["Exited"], margins=True,).sort_values(
by=sorter, ascending=False
)
print("-" * 30, " Volume ", "-" * 30)
print(tab1)
tab1 = pd.crosstab(
df[cols], df["Exited"], margins=True, normalize="index"
).sort_values(by=sorter, ascending=False)
print("-" * 30, " Percentage % ", "-" * 30)
print(tab1)
print("-" * 120)
labeled_barplot(df, cols, perc=True, n=10, hueCol="Exited")
plt.tight_layout()
------------------------------ Volume ------------------------------ Exited 0 1 All Geography All 7963 2037 10000 Germany 1695 814 2509 France 4204 810 5014 Spain 2064 413 2477 ------------------------------ Percentage % ------------------------------ Exited 0 1 Geography Germany 0.675568 0.324432 All 0.796300 0.203700 Spain 0.833266 0.166734 France 0.838452 0.161548 ------------------------------------------------------------------------------------------------------------------------
------------------------------ Volume ------------------------------ Exited 0 1 All Gender All 7963 2037 10000 Female 3404 1139 4543 Male 4559 898 5457 ------------------------------ Percentage % ------------------------------ Exited 0 1 Gender Female 0.749285 0.250715 All 0.796300 0.203700 Male 0.835441 0.164559 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Exited 0 1 All Tenure All 7963 2037 10000 1 803 232 1035 3 796 213 1009 9 771 213 984 5 803 209 1012 4 786 203 989 2 847 201 1048 8 828 197 1025 6 771 196 967 7 851 177 1028 10 389 101 490 0 318 95 413 ------------------------------ Percentage % ------------------------------ Exited 0 1 Tenure 0 0.769976 0.230024 1 0.775845 0.224155 9 0.783537 0.216463 3 0.788900 0.211100 5 0.793478 0.206522 10 0.793878 0.206122 4 0.794742 0.205258 All 0.796300 0.203700 6 0.797311 0.202689 8 0.807805 0.192195 2 0.808206 0.191794 7 0.827821 0.172179 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Exited 0 1 All NumOfProducts All 7963 2037 10000 1 3675 1409 5084 2 4242 348 4590 3 46 220 266 4 0 60 60 ------------------------------ Percentage % ------------------------------ Exited 0 1 NumOfProducts 4 0.000000 1.000000 3 0.172932 0.827068 1 0.722856 0.277144 All 0.796300 0.203700 2 0.924183 0.075817 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Exited 0 1 All HasCrCard All 7963 2037 10000 1 5631 1424 7055 0 2332 613 2945 ------------------------------ Percentage % ------------------------------ Exited 0 1 HasCrCard 0 0.791851 0.208149 All 0.796300 0.203700 1 0.798157 0.201843 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Exited 0 1 All IsActiveMember All 7963 2037 10000 0 3547 1302 4849 1 4416 735 5151 ------------------------------ Percentage % ------------------------------ Exited 0 1 IsActiveMember 0 0.731491 0.268509 All 0.796300 0.203700 1 0.857309 0.142691 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Exited 0 1 All Exited 1 0 2037 2037 All 7963 2037 10000 0 7963 0 7963 ------------------------------ Percentage % ------------------------------ Exited 0 1 Exited 1 0.0000 1.0000 All 0.7963 0.2037 0 1.0000 0.0000 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Exited 0 1 All Age_Grp All 7963 2037 10000 Less_than_50 1812 806 2618 Less_than_60 382 487 869 Less_than_40 3873 473 4346 Less_than_70 243 132 375 Less_than_30 1471 121 1592 Less_than_80 122 14 136 Less_than_20 46 3 49 Less_than_90 12 1 13 Less_than_100 2 0 2 ------------------------------ Percentage % ------------------------------ Exited 0 1 Age_Grp Less_than_60 0.439586 0.560414 Less_than_70 0.648000 0.352000 Less_than_50 0.692131 0.307869 All 0.796300 0.203700 Less_than_40 0.891164 0.108836 Less_than_80 0.897059 0.102941 Less_than_90 0.923077 0.076923 Less_than_30 0.923995 0.076005 Less_than_20 0.938776 0.061224 Less_than_100 1.000000 0.000000 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Exited 0 1 All CreditScore_Grp All 7963 2037 10000 CS_600_700 3065 753 3818 CS_500_600 1892 510 2402 CS_700_800 1997 496 2493 CS_400_500 482 131 613 CS_800_900 527 128 655 CS_300_400 0 19 19 ------------------------------ Percentage % ------------------------------ Exited 0 1 CreditScore_Grp CS_300_400 0.000000 1.000000 CS_400_500 0.786297 0.213703 CS_500_600 0.787677 0.212323 All 0.796300 0.203700 CS_700_800 0.801043 0.198957 CS_600_700 0.802776 0.197224 CS_800_900 0.804580 0.195420 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
------------------------------ Volume ------------------------------ Exited 0 1 All EstimatedSalary_Grp All 7963 2037 10000 Between_150K-200K 1928 527 2455 Between_100K-150K 2038 517 2555 Between_50K-100K 2033 504 2537 Lessthan_50K 1964 489 2453 ------------------------------ Percentage % ------------------------------ Exited 0 1 EstimatedSalary_Grp Between_150K-200K 0.785336 0.214664 All 0.796300 0.203700 Between_100K-150K 0.797652 0.202348 Lessthan_50K 0.800652 0.199348 Between_50K-100K 0.801340 0.198660 ------------------------------------------------------------------------------------------------------------------------
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
fig, axes = plt.subplots(2, 2, figsize=(20, 20))
fig.suptitle("CDF plot of Category variables", fontsize=20)
counter = 0
for ii in range(2):
sns.ecdfplot(ax=axes[ii][0], x=df[category_columnNames[counter]])
counter = counter + 1
if counter != 2 :
sns.ecdfplot(ax=axes[ii][1], x=df[category_columnNames[counter]])
counter = counter + 1
else:
pass
Exited vs Geography
Exited vs Gender
Exited vs Tenure
Exited vs NumOfProducts
Exited vs HasCrCard
Exited vs IsActiveMember
Exited vs Age Group
Exited vs EstimatedSalary_Grp
fig, axes = plt.subplots(2, 2, figsize=(20, 20))
fig.suptitle("CDF plot of numerical variables", fontsize=20)
counter = 0
for ii in range(2):
sns.ecdfplot(ax=axes[ii][0], x=df[number_columnNames[counter]])
counter = counter + 1
if counter != 2 :
sns.ecdfplot(ax=axes[ii][1], x=df[number_columnNames[counter]])
counter = counter + 1
else:
pass
plt.figure(figsize=(15, 25))
for i, variable in enumerate(number_columnNames):
plt.subplot(5, 2, i + 1)
sns.boxplot(df["Exited"], df[variable], palette="PuBu", showfliers=False)
plt.tight_layout()
plt.title(variable)
plt.show()
plt.figure(figsize=(15, 20))
for i, variable in enumerate(number_columnNames):
plt.subplot(5, 2, i + 1)
sns.lineplot(x=variable, y="Exited", data=df)
plt.tight_layout()
plt.title(variable)
plt.show()
Exited vs Credit Score
Exited vs Age
Exited vs Balance
Exited vs Estimated Salary
# Plotting Heatmap by creating a 2-D Matrix with correlation plots
correlation = df.corr()
plt.figure(figsize=(15, 7))
sns.heatmap(correlation, vmin=-1, vmax=1, annot=True, cmap="Spectral")
<AxesSubplot:>
sns.pairplot(df, corner=True, hue="Exited")
<seaborn.axisgrid.PairGrid at 0x22da9770d90>
Age_Grp vs EstimatedSalary_Grp
tab = pd.crosstab(df["Age_Grp"], df["EstimatedSalary_Grp"], normalize="index")
tab.plot(kind="bar", stacked=True)
plt.show()
Observations:
Age Group vs Geography
tab = pd.crosstab(df["Age_Grp"], df["Geography"], normalize="index")
tab.plot(kind="bar", stacked=True)
plt.show()
Observations:
Balance vs Number of Products vs Exited
plt.figure(figsize=(15, 7))
sns.boxplot(x="Balance", y="NumOfProducts", data=df, hue="Exited")
plt.show()
Observations:
HasCrCard vs Balance vs Exited
plt.figure(figsize=(15, 7))
sns.boxplot(x="HasCrCard", y="Balance", data=df, hue="Exited")
plt.show()
Observations:
Age Group vs Balance vs Exited
plt.figure(figsize=(15, 7))
sns.boxplot(x="Age_Grp", y="Balance", data=df, hue="Exited")
plt.show()
Observations:
CreditScore_Grp vs Tenure vs Balance vs Exited
g = sns.FacetGrid(
df, col="CreditScore_Grp", hue="Exited", col_wrap=4, margin_titles=True
)
g.map(sns.scatterplot, "Tenure", "Balance")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x22d980e5880>
Observations:
NumOfProducts vs EstimatedSalary vs EstimatedSalary vs Exited
g = sns.FacetGrid(df, col="NumOfProducts", hue="Exited", col_wrap=4, margin_titles=True)
g.map(sns.scatterplot, "EstimatedSalary", "Age")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x22da599ed00>
Observations:
NumOfProducts vs IsActiveMember vs Age vs Exited
g = sns.FacetGrid(df, col="NumOfProducts", hue="Exited", col_wrap=4, margin_titles=True)
g.map(sns.scatterplot, "IsActiveMember", "Age")
g.add_legend()
<seaborn.axisgrid.FacetGrid at 0x22da9d057f0>
Observations:
CreditScore vs NumOfProducts vs Gender
plt.figure(figsize=(15, 7))
sns.boxplot(x="CreditScore", y="NumOfProducts", data=df, hue="Gender")
plt.show()
Observations
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn_with_threshold(model, predictors, target, threshold=0.5):
"""
Function to compute different metrics, based on the threshold specified, to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# predicting using the independent variables
pred_prob = model.predict(predictors)
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
#target, pred
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
def make_confusion_matrix(
cf,
group_names=None,
categories="auto",
count=True,
percent=True,
cbar=True,
xyticks=True,
xyplotlabels=True,
sum_stats=True,
figsize=None,
cmap="Blues",
title=None,
):
"""
This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.
Arguments
"""
# CODE TO GENERATE TEXT INSIDE EACH SQUARE
blanks = ["" for i in range(cf.size)]
if group_names and len(group_names) == cf.size:
group_labels = ["{}\n".format(value) for value in group_names]
else:
group_labels = blanks
if count:
group_counts = ["{0:0.0f}\n".format(value) for value in cf.flatten()]
else:
group_counts = blanks
if percent:
group_percentages = [
"{0:.2%}".format(value) for value in cf.flatten() / np.sum(cf)
]
else:
group_percentages = blanks
box_labels = [
f"{v1}{v2}{v3}".strip()
for v1, v2, v3 in zip(group_labels, group_counts, group_percentages)
]
box_labels = np.asarray(box_labels).reshape(cf.shape[0], cf.shape[1])
# CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS
if sum_stats:
# Accuracy is sum of diagonal divided by total observations
accuracy = np.trace(cf) / float(np.sum(cf))
# if it is a binary confusion matrix, show some more stats
if len(cf) == 2:
# Metrics for Binary Confusion Matrices
precision = cf[1, 1] / sum(cf[:, 1])
recall = cf[1, 1] / sum(cf[1, :])
f1_score = 2 * precision * recall / (precision + recall)
stats_text = "\n\nAccuracy={:0.3f}\nPrecision={:0.3f}\nRecall={:0.3f}\nF1 Score={:0.3f}".format(
accuracy, precision, recall, f1_score
)
else:
stats_text = "\n\nAccuracy={:0.3f}".format(accuracy)
else:
stats_text = ""
# SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS
if figsize == None:
# Get default figure size if not set
figsize = plt.rcParams.get("figure.figsize")
if xyticks == False:
# Do not show categories if xyticks is False
categories = False
# MAKE THE HEATMAP VISUALIZATION
plt.figure(figsize=figsize)
sns.heatmap(
cf,
annot=box_labels,
fmt="",
cmap=cmap,
cbar=cbar,
xticklabels=categories,
yticklabels=categories,
)
if xyplotlabels:
plt.ylabel("True label")
plt.xlabel("Predicted label" + stats_text)
else:
plt.xlabel(stats_text)
if title:
plt.title(title)
def validate(model, history, testFlag=False):
# Capturing learning history per epoch
plt.plot(history.history["recall"])
plt.plot(history.history["val_recall"])
plt.title("Recall vs Epochs")
plt.ylabel("Recall")
plt.xlabel("Epoch")
plt.legend(["Train", "Validation"], loc="lower right")
plt.show()
model.evaluate(X_train, y_train, verbose=1)
train_pred = np.round(model.predict(X_train))
model.evaluate(X_val, y_val, verbose=1)
val_pred = np.round(model.predict(X_val))
labels = ['True Negative','False Positive','False Negative','True Positive']
categories = [ 'Not_Fraud','Fraud']
cm2=confusion_matrix(y_train, train_pred)
make_confusion_matrix(cm2, group_names=labels, cmap='Blues')
cm2=confusion_matrix(y_val, val_pred)
make_confusion_matrix(cm2, group_names=labels, cmap='Blues')
print("Training Recall Score: " , recall_score(y_train, train_pred))
print("Validation Recall Score: " , recall_score(y_val, val_pred))
if testFlag:
model.evaluate(X_test, y_test, verbose=1)
test_pred = np.round(model.predict(X_test))
print("Test Recall Score: " , recall_score(y_test, test_pred))
cm2=confusion_matrix(y_test, test_pred)
make_confusion_matrix(cm2, group_names=labels, cmap='Blues')
# Dropping off the following columns since they will not play a part in determing the model for the customers purchasing the new product
df.drop(["CreditScore_Grp"], axis=1, inplace=True)
df.drop(["Age_Grp"], axis=1, inplace=True)
df.drop(["EstimatedSalary_Grp"], axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditScore 10000 non-null int64 1 Geography 10000 non-null category 2 Gender 10000 non-null category 3 Age 10000 non-null int64 4 Tenure 10000 non-null category 5 Balance 10000 non-null float64 6 NumOfProducts 10000 non-null category 7 HasCrCard 10000 non-null category 8 IsActiveMember 10000 non-null category 9 EstimatedSalary 10000 non-null float64 10 Exited 10000 non-null category dtypes: category(7), float64(2), int64(2) memory usage: 382.2 KB
df.head()
| CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 619 | France | Female | 42 | 2 | 117130.115 | 1 | 1 | 1 | 101348.88 | 1 |
| 1 | 608 | Spain | Female | 41 | 1 | 83807.860 | 1 | 0 | 1 | 112542.58 | 0 |
| 2 | 502 | France | Female | 42 | 8 | 159660.800 | 3 | 1 | 0 | 113931.57 | 1 |
| 3 | 699 | France | Female | 39 | 1 | 123870.070 | 2 | 0 | 0 | 93826.63 | 0 |
| 4 | 850 | Spain | Female | 43 | 2 | 125510.820 | 1 | 1 | 1 | 79084.10 | 0 |
X = df.drop(["Exited"], axis=1)
y = df["Exited"]
# Splitting data into training, validation and test sets:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6000, 10) (2000, 10) (2000, 10)
X_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6000 entries, 4472 to 29 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditScore 6000 non-null int64 1 Geography 6000 non-null category 2 Gender 6000 non-null category 3 Age 6000 non-null int64 4 Tenure 6000 non-null category 5 Balance 6000 non-null float64 6 NumOfProducts 6000 non-null category 7 HasCrCard 6000 non-null category 8 IsActiveMember 6000 non-null category 9 EstimatedSalary 6000 non-null float64 dtypes: category(6), float64(2), int64(2) memory usage: 270.6 KB
X_train.head(10)
| CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 4472 | 660 | Germany | Female | 23 | 6 | 166070.480 | 2 | 0 | 0 | 90494.72 |
| 4034 | 601 | Spain | Female | 42 | 4 | 96763.890 | 1 | 1 | 1 | 199242.65 |
| 1454 | 521 | Spain | Female | 34 | 7 | 70731.070 | 1 | 1 | 1 | 20243.97 |
| 9099 | 738 | France | Male | 29 | 2 | 118627.160 | 2 | 1 | 1 | 170421.13 |
| 2489 | 714 | France | Male | 28 | 6 | 122724.370 | 1 | 1 | 1 | 67057.27 |
| 9615 | 692 | Spain | Female | 47 | 3 | 120539.815 | 2 | 1 | 0 | 150802.41 |
| 1452 | 687 | France | Female | 35 | 3 | 99587.430 | 1 | 1 | 1 | 1713.10 |
| 1515 | 850 | Spain | Male | 39 | 6 | 133214.130 | 1 | 0 | 1 | 20769.88 |
| 2086 | 725 | Spain | Female | 32 | 0 | 117307.470 | 2 | 1 | 1 | 138525.19 |
| 7243 | 634 | France | Male | 77 | 5 | 133086.130 | 2 | 1 | 1 | 161579.85 |
# Creating dummy variables for categorical variables
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)
X_train
| CreditScore | Age | Balance | EstimatedSalary | Geography_Germany | Geography_Spain | Gender_Male | Tenure_1 | Tenure_2 | Tenure_3 | Tenure_4 | Tenure_5 | Tenure_6 | Tenure_7 | Tenure_8 | Tenure_9 | Tenure_10 | NumOfProducts_2 | NumOfProducts_3 | NumOfProducts_4 | HasCrCard_1 | IsActiveMember_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4472 | 660 | 23 | 166070.48 | 90494.72 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4034 | 601 | 42 | 96763.89 | 199242.65 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1454 | 521 | 34 | 70731.07 | 20243.97 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 9099 | 738 | 29 | 118627.16 | 170421.13 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
| 2489 | 714 | 28 | 122724.37 | 67057.27 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6444 | 693 | 37 | 95900.04 | 38196.24 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4806 | 697 | 33 | 87347.70 | 172524.51 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2622 | 627 | 27 | 185267.45 | 77027.34 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
| 203 | 711 | 38 | 129022.06 | 14374.86 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
| 29 | 411 | 29 | 59697.17 | 53483.21 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
6000 rows Ć 22 columns
## Scaling the data
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_val = sc.transform(X_test)
X_test = sc.transform(X_test)
X_train
array([[ 0.09158519, -1.51903358, 1.89430572, ..., -0.08396038,
-1.54528846, -1.04013825],
[-0.5163039 , 0.27998394, -0.92266913, ..., -0.08396038,
0.64712837, 0.96141066],
[-1.34056029, -0.47749712, -1.98077629, ..., -0.08396038,
0.64712837, 0.96141066],
...,
[-0.24842057, -1.14029305, 2.67456892, ..., -0.08396038,
0.64712837, 0.96141066],
[ 0.61704864, -0.09875659, 0.38846814, ..., -0.08396038,
0.64712837, 0.96141066],
[-2.47391282, -0.95092278, -2.42925051, ..., -0.08396038,
0.64712837, 0.96141066]])
print("Shape of X Training set : ", X_train.shape)
print("Shape of X validation set : ", X_val.shape)
print("Shape of X test set : ", X_test.shape)
print("")
print("Shape of Y Training set : ", y_train.shape)
print("Shape of Y test set : ", y_val.shape)
print("Shape of Y test set : ", y_test.shape)
print("")
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("")
print("Percentage of classes in validation set:")
print(y_val.value_counts(normalize=True))
print("")
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of X Training set : (6000, 22) Shape of X validation set : (2000, 22) Shape of X test set : (2000, 22) Shape of Y Training set : (6000,) Shape of Y test set : (2000,) Shape of Y test set : (2000,) Percentage of classes in training set: 0 0.796333 1 0.203667 Name: Exited, dtype: float64 Percentage of classes in validation set: 0 0.796 1 0.204 Name: Exited, dtype: float64 Percentage of classes in test set: 0 0.7965 1 0.2035 Name: Exited, dtype: float64
backend.clear_session()
# Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
tf.random.set_seed(1)
# Initializing the model
model = Sequential()
# Adding input layer with 64 neurons, relu as activation function.
model.add(Dense(64, activation="relu", input_shape=(22,)))
# Adding the first hidden layer with 32 neurons, relu as activation function
model.add(Dense(32, activation="relu"))
# Adding the second hidden layer with 8 neurons, relu as activation function
model.add(Dense(8, activation="relu"))
# Adding the output layer with one neuron and sigmoid as activation
model.add(Dense(1, activation="sigmoid"))
model.summary()
# Compile the model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss="binary_crossentropy",
metrics=["Recall"],
)
# Fitting the model on X_train and y_train with 50 epcohs
history = model.fit(
X_train, y_train, epochs=50, validation_data=(X_val, y_val), verbose=0
)
validate(model, history, True)
modelTrainDF_Model1 = model_performance_classification_sklearn_with_threshold(
model, X_train, y_train
)
modelTestDF_Model1 = model_performance_classification_sklearn_with_threshold(
model, X_test, y_test
)
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 64) 1472
dense_1 (Dense) (None, 32) 2080
dense_2 (Dense) (None, 8) 264
dense_3 (Dense) (None, 1) 9
=================================================================
Total params: 3,825
Trainable params: 3,825
Non-trainable params: 0
_________________________________________________________________
188/188 [==============================] - 0s 2ms/step - loss: 0.1474 - recall: 0.8568 188/188 [==============================] - 0s 2ms/step 63/63 [==============================] - 0s 2ms/step - loss: 1.3340 - recall: 0.1838 63/63 [==============================] - 0s 1ms/step Training Recall Score: 0.8567921440261865 Validation Recall Score: 0.18382352941176472 63/63 [==============================] - 0s 2ms/step - loss: 0.5733 - recall: 0.5135 63/63 [==============================] - 0s 1ms/step Test Recall Score: 0.5135135135135135 188/188 [==============================] - 0s 2ms/step 63/63 [==============================] - 0s 2ms/step
Observations
backend.clear_session()
# Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
tf.random.set_seed(1)
# Initializing the model
model = Sequential()
# Adding input layer with 128 neurons, relu as activation function
model.add(Dense(128, activation="relu", input_shape=(22,)))
# Adding the first hidden layer with 128 neurons, relu as activation function
model.add(Dense(128, activation="relu"))
# Adding hidden layer with 64 neurons, relu as activation function
model.add(Dense(64, activation="relu"))
# Adding hidden layer with 64 neurons, relu as activation function
model.add(Dense(64, activation="relu"))
# Adding hidden layer with 32 neurons, relu as activation function
model.add(Dense(32, activation="relu"))
# Adding the second hidden layer with 32 neurons, relu as activation function
model.add(Dense(32, activation="relu"))
# Adding the second hidden layer with 16 neurons, relu as activation function
model.add(Dense(16, activation="relu"))
# Adding the second hidden layer with 16 neurons, relu as activation function
model.add(Dense(16, activation="relu"))
# Adding the second hidden layer with 8 neurons, relu as activation function
model.add(Dense(8, activation="relu"))
# Adding the output layer with one neuron and sigmoid as activation
model.add(Dense(1, activation="sigmoid"))
model.summary()
# Compile the model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss="binary_crossentropy",
metrics=["Recall"],
)
# With Early Stopping
es_cb = callbacks.EarlyStopping(monitor="recall", min_delta=0.001, patience=5)
# Fitting the model on X_train and y_train with 150 epcohs
history = model.fit(
X_train,
y_train,
epochs=150,
batch_size=700,
validation_data=(X_val, y_val),
callbacks=es_cb,
verbose=0,
)
validate(model, history, True)
modelTrainDF_Model2 = model_performance_classification_sklearn_with_threshold(
model, X_train, y_train
)
modelTestDF_Model2 = model_performance_classification_sklearn_with_threshold(
model, X_test, y_test
)
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 128) 2944
dense_1 (Dense) (None, 128) 16512
dense_2 (Dense) (None, 64) 8256
dense_3 (Dense) (None, 64) 4160
dense_4 (Dense) (None, 32) 2080
dense_5 (Dense) (None, 32) 1056
dense_6 (Dense) (None, 16) 528
dense_7 (Dense) (None, 16) 272
dense_8 (Dense) (None, 8) 136
dense_9 (Dense) (None, 1) 9
=================================================================
Total params: 35,953
Trainable params: 35,953
Non-trainable params: 0
_________________________________________________________________
188/188 [==============================] - 0s 2ms/step - loss: 0.3602 - recall: 0.4468 188/188 [==============================] - 0s 2ms/step 63/63 [==============================] - 0s 3ms/step - loss: 0.7020 - recall: 0.1275 63/63 [==============================] - 0s 2ms/step Training Recall Score: 0.44680851063829785 Validation Recall Score: 0.12745098039215685 63/63 [==============================] - 0s 2ms/step - loss: 0.3829 - recall: 0.4251 63/63 [==============================] - 0s 2ms/step Test Recall Score: 0.4250614250614251 188/188 [==============================] - 0s 2ms/step 63/63 [==============================] - 0s 2ms/step
Observations
backend.clear_session()
# Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
tf.random.set_seed(1)
# Initializing the model
model = Sequential()
# Adding input layer with 128 neurons, relu as activation function and, he_normal as weight initializer.
model.add(Dense(128, activation="relu", kernel_initializer="he_normal", input_shape=(22,)))
# Adding hidden layer with 64 neurons, relu as activation function and, he_normal as weight initializer.
model.add(Dense(64, activation="relu", kernel_initializer="he_normal", ))
# Adding hidden layer with 32 neurons, relu as activation function and, he_normal as weight initializer
model.add(Dense(32, activation="relu", kernel_initializer="he_normal",))
model.add(BatchNormalization())
# Adding hidden layer with 16 neurons, relu as activation function and, he_normal as weight initializer
model.add(Dense(16, activation="relu", kernel_initializer="he_normal",))
model.add(BatchNormalization())
# Adding Dropout with 20%
model.add(Dropout(0.2))
model.add(BatchNormalization())
# Adding hidden layer with 8 neurons, relu as activation function and, he_normal as weight initializer
model.add(Dense(8, activation="relu", kernel_initializer="he_normal",))
model.add(BatchNormalization())
# Adding the output layer with one neuron and sigmoid as activation
model.add(Dense(1, activation="sigmoid"))
model.summary()
# Compile the model
model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
loss="binary_crossentropy",
metrics=["Recall"],
)
# With Early Stopping
es_cb = callbacks.EarlyStopping(monitor="recall", min_delta=0.001, patience=5)
# Fitting the model on X_train and y_train with 150 epcohs
history = model.fit(
X_train,
y_train,
epochs=150,
batch_size=700,
validation_data=(X_val, y_val),
verbose=0,
callbacks=es_cb
)
validate(model, history, True)
modelTrainDF_Model3 = model_performance_classification_sklearn_with_threshold(
model, X_train, y_train
)
modelTestDF_Model3 = model_performance_classification_sklearn_with_threshold(
model, X_test, y_test
)
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 128) 2944
dense_1 (Dense) (None, 64) 8256
dense_2 (Dense) (None, 32) 2080
batch_normalization (BatchN (None, 32) 128
ormalization)
dense_3 (Dense) (None, 16) 528
batch_normalization_1 (Batc (None, 16) 64
hNormalization)
dropout (Dropout) (None, 16) 0
batch_normalization_2 (Batc (None, 16) 64
hNormalization)
dense_4 (Dense) (None, 8) 136
batch_normalization_3 (Batc (None, 8) 32
hNormalization)
dense_5 (Dense) (None, 1) 9
=================================================================
Total params: 14,241
Trainable params: 14,097
Non-trainable params: 144
_________________________________________________________________
188/188 [==============================] - 0s 2ms/step - loss: 0.4480 - recall: 0.2774 188/188 [==============================] - 1s 2ms/step 63/63 [==============================] - 0s 3ms/step - loss: 0.5787 - recall: 0.0784 63/63 [==============================] - 0s 2ms/step Training Recall Score: 0.2774140752864157 Validation Recall Score: 0.0784313725490196 63/63 [==============================] - 0s 2ms/step - loss: 0.4692 - recall: 0.2359 63/63 [==============================] - 0s 2ms/step Test Recall Score: 0.23587223587223588 188/188 [==============================] - 0s 2ms/step 63/63 [==============================] - 0s 2ms/step
Observations
backend.clear_session()
# Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
import random
random.seed(1)
tf.random.set_seed(1)
# Initializing the model
model = Sequential()
# Adding input layer with 128 neurons, tanh as activation function and, he_uniform as weight initializer.
model.add(
Dense(128, activation="tanh", kernel_initializer="he_uniform", input_shape=(22,))
)
# Adding the first hidden layer with 64 neurons, tanh as activation function and, he_uniform as weight initializer
model.add(Dense(64, activation="tanh", kernel_initializer="he_uniform"))
# Adding Dropout with 20%
model.add(Dropout(0.2))
model.add(BatchNormalization())
# Adding the second hidden layer with 32 neurons, tanh as activation function and, he_uniform as weight initializer
model.add(Dense(32, activation="tanh", kernel_initializer="he_uniform"))
model.add(BatchNormalization())
# Adding the second hidden layer with 32 neurons, tanh as activation function and, he_uniform as weight initializer
model.add(Dense(32, activation="tanh", kernel_initializer="he_uniform"))
model.add(BatchNormalization())
# Adding the second hidden layer with 16 neurons, tanh as activation function and, he_uniform as weight initializer
model.add(Dense(16, activation="tanh", kernel_initializer="he_uniform"))
model.add(BatchNormalization())
# Adding the output layer with one neuron and sigmoid as activation
model.add(Dense(1, activation="sigmoid"))
model.summary()
# Compile the model
model.compile(
optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.01),
loss="binary_crossentropy",
metrics=["Recall"],
)
# Fitting the model on X_train and y_train with 150 epcohs
history = model.fit(
X_train,
y_train,
epochs=150,
batch_size=700,
validation_data=(X_val, y_val),
verbose=0,
)
validate(model, history, True)
modelTrainDF_Model4 = model_performance_classification_sklearn_with_threshold(
model, X_train, y_train
)
modelTestDF_Model4 = model_performance_classification_sklearn_with_threshold(
model, X_test, y_test
)
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 128) 2944
dense_1 (Dense) (None, 64) 8256
dropout (Dropout) (None, 64) 0
batch_normalization (BatchN (None, 64) 256
ormalization)
dense_2 (Dense) (None, 32) 2080
batch_normalization_1 (Batc (None, 32) 128
hNormalization)
dense_3 (Dense) (None, 32) 1056
batch_normalization_2 (Batc (None, 32) 128
hNormalization)
dense_4 (Dense) (None, 16) 528
batch_normalization_3 (Batc (None, 16) 64
hNormalization)
dense_5 (Dense) (None, 1) 17
=================================================================
Total params: 15,457
Trainable params: 15,169
Non-trainable params: 288
_________________________________________________________________
188/188 [==============================] - 0s 2ms/step - loss: 0.3078 - recall: 0.5565 188/188 [==============================] - 1s 2ms/step 63/63 [==============================] - 0s 3ms/step - loss: 0.7482 - recall: 0.1201 63/63 [==============================] - 0s 2ms/step Training Recall Score: 0.5564648117839607 Validation Recall Score: 0.12009803921568628 63/63 [==============================] - 0s 3ms/step - loss: 0.3807 - recall: 0.4521 63/63 [==============================] - 0s 2ms/step Test Recall Score: 0.4520884520884521 188/188 [==============================] - 0s 2ms/step 63/63 [==============================] - 0s 2ms/step
Observations
backend.clear_session()
# Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
tf.random.set_seed(1)
# Initializing the model
model = Sequential()
# Adding input layer with 128 neurons, relu as activation function and, he_normal as weight initializer.
model.add(Dense(128, activation="relu", kernel_initializer="he_normal", input_shape=(22,)))
# Adding hidden layer with 64 neurons, relu as activation function and, he_normal as weight initializer.
model.add(Dense(64, activation="relu", kernel_initializer="he_normal", ))
# Adding hidden layer with 32 neurons, relu as activation function and, he_normal as weight initializer
model.add(Dense(32, activation="relu", kernel_initializer="he_normal",))
# Adding Dropout with 20%
model.add(Dropout(0.2))
model.add(BatchNormalization())
# Adding hidden layer with 16 neurons, relu as activation function and, he_normal as weight initializer
model.add(Dense(16, activation="relu", kernel_initializer="he_normal",))
model.add(BatchNormalization())
# Adding Dropout with 20%
model.add(Dropout(0.2))
model.add(BatchNormalization())
# Adding hidden layer with 8 neurons, relu as activation function and, he_normal as weight initializer
model.add(Dense(8, activation="relu", kernel_initializer="he_normal",))
model.add(BatchNormalization())
# Adding the output layer with one neuron and sigmoid as activation
model.add(Dense(1, activation="sigmoid"))
model.summary()
# Compile the model
model.compile(
optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.1),
loss="binary_crossentropy",
metrics=["Recall"],
)
# Fitting the model on X_train and y_train with 150 epcohs
history = model.fit(
X_train,
y_train,
epochs=150,
batch_size=700,
validation_data=(X_val, y_val),
verbose=0,
)
validate(model, history, True)
modelTrainDF_Model5 = model_performance_classification_sklearn_with_threshold(
model, X_train, y_train
)
modelTestDF_Model5 = model_performance_classification_sklearn_with_threshold(
model, X_test, y_test
)
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 128) 2944
dense_1 (Dense) (None, 64) 8256
dense_2 (Dense) (None, 32) 2080
dropout (Dropout) (None, 32) 0
batch_normalization (BatchN (None, 32) 128
ormalization)
dense_3 (Dense) (None, 16) 528
batch_normalization_1 (Batc (None, 16) 64
hNormalization)
dropout_1 (Dropout) (None, 16) 0
batch_normalization_2 (Batc (None, 16) 64
hNormalization)
dense_4 (Dense) (None, 8) 136
batch_normalization_3 (Batc (None, 8) 32
hNormalization)
dense_5 (Dense) (None, 1) 9
=================================================================
Total params: 14,241
Trainable params: 14,097
Non-trainable params: 144
_________________________________________________________________
188/188 [==============================] - 0s 2ms/step - loss: 0.0428 - recall: 0.9722 188/188 [==============================] - 0s 2ms/step 63/63 [==============================] - 0s 2ms/step - loss: 3.4411 - recall: 0.2279 63/63 [==============================] - 0s 2ms/step Training Recall Score: 0.972176759410802 Validation Recall Score: 0.22794117647058823 63/63 [==============================] - 0s 3ms/step - loss: 1.4872 - recall: 0.5823 63/63 [==============================] - 0s 2ms/step Test Recall Score: 0.5823095823095823 188/188 [==============================] - 0s 2ms/step 63/63 [==============================] - 0s 1ms/step
Observations
backend.clear_session()
# Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
import random
random.seed(1)
tf.random.set_seed(1)
# Initializing the model
model = Sequential()
# Adding input layer with 128 neurons, tanh as activation function and, he_uniform as weight initializer.
model.add(
Dense(128, activation="tanh", kernel_initializer="he_uniform", input_shape=(22,))
)
# Adding the first hidden layer with 64 neurons, tanh as activation function and, he_uniform as weight initializer
model.add(Dense(64, activation="tanh", kernel_initializer="he_uniform"))
# Adding Dropout with 20%
model.add(Dropout(0.2))
model.add(BatchNormalization())
# Adding the second hidden layer with 32 neurons, tanh as activation function and, he_uniform as weight initializer
model.add(Dense(32, activation="tanh", kernel_initializer="he_uniform"))
model.add(BatchNormalization())
# Adding the second hidden layer with 32 neurons, tanh as activation function and, he_uniform as weight initializer
model.add(Dense(32, activation="tanh", kernel_initializer="he_uniform"))
model.add(BatchNormalization())
# Adding the second hidden layer with 16 neurons, tanh as activation function and, he_uniform as weight initializer
model.add(Dense(16, activation="tanh", kernel_initializer="he_uniform"))
model.add(BatchNormalization())
# Adding the output layer with one neuron and sigmoid as activation
model.add(Dense(1, activation="sigmoid"))
model.summary()
# Compile the model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
loss="binary_crossentropy",
metrics=["Recall"],
)
# Fitting the model on X_train and y_train with 150 epcohs
history = model.fit(
X_train,
y_train,
epochs=150,
batch_size=700,
validation_data=(X_val, y_val),
verbose=0,
)
validate(model, history, True)
modelTrainDF_Model6 = model_performance_classification_sklearn_with_threshold(
model, X_train, y_train
)
modelTestDF_Model6 = model_performance_classification_sklearn_with_threshold(
model, X_test, y_test
)
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 128) 2944
dense_1 (Dense) (None, 64) 8256
dropout (Dropout) (None, 64) 0
batch_normalization (BatchN (None, 64) 256
ormalization)
dense_2 (Dense) (None, 32) 2080
batch_normalization_1 (Batc (None, 32) 128
hNormalization)
dense_3 (Dense) (None, 32) 1056
batch_normalization_2 (Batc (None, 32) 128
hNormalization)
dense_4 (Dense) (None, 16) 528
batch_normalization_3 (Batc (None, 16) 64
hNormalization)
dense_5 (Dense) (None, 1) 17
=================================================================
Total params: 15,457
Trainable params: 15,169
Non-trainable params: 288
_________________________________________________________________
188/188 [==============================] - 0s 2ms/step - loss: 0.0201 - recall: 0.9812 188/188 [==============================] - 1s 2ms/step 63/63 [==============================] - 0s 3ms/step - loss: 2.0669 - recall: 0.1912 63/63 [==============================] - 0s 2ms/step Training Recall Score: 0.9811783960720131 Validation Recall Score: 0.19117647058823528 63/63 [==============================] - 0s 3ms/step - loss: 0.9300 - recall: 0.5037 63/63 [==============================] - 0s 2ms/step Test Recall Score: 0.5036855036855037 188/188 [==============================] - 0s 2ms/step 63/63 [==============================] - 0s 2ms/step
Observations
backend.clear_session()
# Fixing the seed for random number generators so that we can ensure we receive the same output everytime
np.random.seed(1)
import random
random.seed(1)
tf.random.set_seed(1)
# Initializing the model
model = Sequential()
# Adding input layer with 128 neurons, relu as activation function and, he_normal as weight initializer.
model.add(
Dense(128, activation="relu", kernel_initializer="he_normal", input_shape=(22,))
)
# Adding the first hidden layer with 64 neurons, relu as activation function and, he_normal as weight initializer
model.add(Dense(64, activation="relu", kernel_initializer="he_normal"))
# Adding Dropout with 20%
model.add(Dropout(0.2))
model.add(BatchNormalization())
# Adding the second hidden layer with 32 neurons, relu as activation function and, he_normal as weight initializer
model.add(Dense(32, activation="relu", kernel_initializer="he_normal"))
model.add(BatchNormalization())
# Adding hidden layer with 32 neurons, relu as activation function and, he_normal as weight initializer
model.add(Dense(32, activation="relu", kernel_initializer="he_normal"))
# Adding Dropout with 20%
model.add(Dropout(0.2))
model.add(BatchNormalization())
# Adding hidden layer with 16 neurons, relu as activation function and, he_normal as weight initializer
model.add(Dense(16, activation="relu", kernel_initializer="he_normal"))
model.add(BatchNormalization())
# Adding the output layer with one neuron and sigmoid as activation
model.add(Dense(1, activation="sigmoid"))
model.summary()
# Compile the model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss="binary_crossentropy",
metrics=["Recall"],
)
# With Early Stopping
es_cb = callbacks.EarlyStopping(monitor="val_loss", min_delta=0.001, patience=5)
# Fitting the model on X_train and y_train with 300 epcohs
history = model.fit(
X_train,
y_train,
epochs=300,
batch_size=50,
validation_data=(X_val, y_val),
callbacks=es_cb,
verbose=0,
)
validate(model, history, True)
modelTrainDF_Model7 = model_performance_classification_sklearn_with_threshold(
model, X_train, y_train
)
modelTestDF_Model7 = model_performance_classification_sklearn_with_threshold(
model, X_test, y_test
)
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 128) 2944
dense_1 (Dense) (None, 64) 8256
dropout (Dropout) (None, 64) 0
batch_normalization (BatchN (None, 64) 256
ormalization)
dense_2 (Dense) (None, 32) 2080
batch_normalization_1 (Batc (None, 32) 128
hNormalization)
dense_3 (Dense) (None, 32) 1056
dropout_1 (Dropout) (None, 32) 0
batch_normalization_2 (Batc (None, 32) 128
hNormalization)
dense_4 (Dense) (None, 16) 528
batch_normalization_3 (Batc (None, 16) 64
hNormalization)
dense_5 (Dense) (None, 1) 17
=================================================================
Total params: 15,457
Trainable params: 15,169
Non-trainable params: 288
_________________________________________________________________
188/188 [==============================] - 1s 2ms/step - loss: 0.3331 - recall: 0.4591 188/188 [==============================] - 0s 2ms/step 63/63 [==============================] - 0s 3ms/step - loss: 0.7137 - recall: 0.1054 63/63 [==============================] - 0s 2ms/step Training Recall Score: 0.4590834697217676 Validation Recall Score: 0.1053921568627451 63/63 [==============================] - 0s 3ms/step - loss: 0.3801 - recall: 0.4103 63/63 [==============================] - 0s 2ms/step Test Recall Score: 0.4103194103194103 188/188 [==============================] - 0s 2ms/step 63/63 [==============================] - 0s 1ms/step
Observations
# training performance comparison
models_train_comp_df = pd.concat(
[
modelTrainDF_Model1.T,
modelTrainDF_Model2.T,
modelTrainDF_Model3.T,
modelTrainDF_Model4.T,
modelTrainDF_Model5.T,
modelTrainDF_Model6.T,
modelTrainDF_Model7.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Model 1 - Less layers with Adam",
"Model 2 - With Early Stopping Tuning",
"Model 3 - SGD with Relu Function, Dropout & BatchNorm. Tuning",
"Model 4 - Adagrad with TanH Function, Dropout & BatchNorm. Tuning",
"Model 5 - RMSProp with ReLu Function",
"Model 6 - Adam with TanH Function, Dropout & BatchNorm. Tuning",
"Model 7 - Adam with ReLu Function, Dropout & BatchNorm. Tuning¶",
]
print("Training performance comparison:")
print(models_train_comp_df.T)
# Testing performance comparison
models_test_comp_df = pd.concat(
[
modelTestDF_Model1.T,
modelTestDF_Model2.T,
modelTestDF_Model3.T,
modelTestDF_Model4.T,
modelTestDF_Model5.T,
modelTestDF_Model6.T,
modelTestDF_Model7.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Model 1 - Less layers with Adam",
"Model 2 - With Early Stopping Tuning",
"Model 3 - SGD with Relu Function, Dropout & BatchNorm. Tuning",
"Model 4 - Adagrad with TanH Function, Dropout & BatchNorm. Tuning",
"Model 5 - RMSProp with ReLu Function",
"Model 6 - Adam with TanH Function, Dropout & BatchNorm. Tuning",
"Model 7 - Adam with ReLu Function, Dropout & BatchNorm. Tuning¶",
]
print("\n\n")
print("Testing performance comparison:")
print(models_test_comp_df.T)
Training performance comparison:
Accuracy Recall \
Model 1 - Less layers with Adam 0.943000 0.856792
Model 2 - With Early Stopping Tuning 0.847000 0.446809
Model 3 - SGD with Relu Function, Dropout & Bat... 0.830667 0.277414
Model 4 - Adagrad with TanH Function, Dropout &... 0.877000 0.556465
Model 5 - RMSProp with ReLu Function 0.984333 0.972177
Model 6 - Adam with TanH Function, Dropout & Ba... 0.994167 0.981178
Model 7 - Adam with ReLu Function, Dropout & Ba... 0.862333 0.459083
Precision F1
Model 1 - Less layers with Adam 0.862438 0.859606
Model 2 - With Early Stopping Tuning 0.692893 0.543284
Model 3 - SGD with Relu Function, Dropout & Bat... 0.718220 0.400236
Model 4 - Adagrad with TanH Function, Dropout &... 0.776256 0.648236
Model 5 - RMSProp with ReLu Function 0.951923 0.961943
Model 6 - Adam with TanH Function, Dropout & Ba... 0.990091 0.985614
Model 7 - Adam with ReLu Function, Dropout & Ba... 0.772727 0.575975
Testing performance comparison:
Accuracy Recall \
Model 1 - Less layers with Adam 0.8100 0.513514
Model 2 - With Early Stopping Tuning 0.8390 0.425061
Model 3 - SGD with Relu Function, Dropout & Bat... 0.8195 0.235872
Model 4 - Adagrad with TanH Function, Dropout &... 0.8470 0.452088
Model 5 - RMSProp with ReLu Function 0.8015 0.582310
Model 6 - Adam with TanH Function, Dropout & Ba... 0.8170 0.503686
Model 7 - Adam with ReLu Function, Dropout & Ba... 0.8510 0.410319
Precision F1
Model 1 - Less layers with Adam 0.534527 0.523810
Model 2 - With Early Stopping Tuning 0.662835 0.517964
Model 3 - SGD with Relu Function, Dropout & Bat... 0.657534 0.347197
Model 4 - Adagrad with TanH Function, Dropout &... 0.689139 0.545994
Model 5 - RMSProp with ReLu Function 0.510776 0.544202
Model 6 - Adam with TanH Function, Dropout & Ba... 0.555556 0.528351
Model 7 - Adam with ReLu Function, Dropout & Ba... 0.742222 0.528481
Analyzing the performance of each of the models,
Model 4 seems to have the best recall scores, accuracy & F1 scores with less overfitting, followed by Model 7. Model 6 has good scores too, but it is overfitting.
Based on the Customer Information:
Based on the data patterns of the Customers, we found the following insights that can be leveraged as recommendations for understanding the Customers: